# Lecture 27: Embedded Systems Safety

Seyed-Hosein Attarzadeh-Niaki

Based on Slides by Philip Koopman

Embedded Real-Time Systems

-

### **Outline**

- Critical Systems
- Embedded Software Safety Overview
- Safety Plan & Safety Standards
- Safety Requirements
- Single Points of Failure
- Isolation Mechanisms
- Safety Architectural Patterns

Embedded Real-Time Systems

#### **CRITICAL SYSTEMS**

Embedded Real-Time Systems

3

### **Critical Systems**

- Critical systems require low failure rates
- **SIL** = Safety Integrity Level
  - Higher level of integrity needed for higher risk
- Safety critical: Loss of life, injury, environmental damage
  - Special care must be taken to avoid deaths
- Mission critical: Brand tarnish, financial loss, company failure
  - Consider a safety critical approach

Embedded Real-Time Systems

### What Is The Worst Case Failure?

- Worst case might not be obvious
  - Aircraft: software can cause a crash
  - Thermostats/HVAC: software can freezing plumbing
    - Can rarely! also kill small children due to overheating
- Key thought experiment
  - What's the worst that can happen if ...
    - · ... your system intentionally tried to cause harm?
  - ... this identifies system hazards to mitigate
- Failure consequence varies, typically
  - Multiple fatalities (e.g., plane crash)
  - Single fatality (e.g., single-vehicle car crash)
  - Severe injuries
  - Minor injuries
  - Can consider analogies for mission-critical





Embedded Real-Time Systems

### Safety Integrity Level (SIL)

- SIL represents
  - The risk presented by a system-level hazard
  - The engineering rigor applied to mitigate the risk
  - The permissible residual probability after mitigation

#### Example:

#### DO-178 (aviation flight hours)

- DAL A (Catastrophic): 10<sup>9</sup>hrs/failure = 114077 years
- DAL B (Hazardous): 10<sup>7</sup>hrs/failure= 1141 years
- DAL C (Major): 10<sup>5</sup>hrs/failure= 11 years
- DAL D (Minor): 10<sup>3</sup>hrs/failure= 42 days
- (DAL: Design Assurance Level)



#### Example: **IEC 61508 (industrial controls)**

- SIL 4: 108hrs/dangerous failure = 11408 years
- SIL 3: 10<sup>7</sup>hrs/dangerous failure= 1141 years
- SIL 2: 10<sup>6</sup>hrs/dangerous failure= 114 years
- SIL 1: 10<sup>5</sup>hrs/dangerous failure= 11 years

Embedded Real-Time Systems

### Higher SIL Invokes More Engineering Rigor

Example: IEC 61508

- HR = Highly Recommended
- R = Recommended
- NR = Not Recommended (don't do this)

SIL 1: lowest integrity level (low risk)

SIL 4: highest integrity level (unacceptable risk)

|    | Technique/Measure*                                                                         | Ref          | SIL1 | SIL2 | SIL3 | SIL4 |
|----|--------------------------------------------------------------------------------------------|--------------|------|------|------|------|
| 1  | Fault detection and diagnosis                                                              | C.3.1        |      | R    | HR   | HR   |
| 2  | Error detecting and correcting codes                                                       | C.3.2        | R    | R    | R    | HR   |
| 3a | Failure assertion programming                                                              | C.3.3        | R    | R    | R    | HR   |
| 3b | Safety bag techniques                                                                      | C.3.4        |      | R    | R    | R    |
| 3с | Diverse programming                                                                        | C.3.5        | R    | R    | R    | HR   |
| 3d | Recovery block                                                                             | C.3.6        | R    | R    | R    | R    |
| 3е | Backward recovery                                                                          | C.3.7        | R    | R    | R    | R    |
| 3f | Forward recovery                                                                           | C.3.8        | R    | R    | R    | R    |
| 3g | Re-try fault recovery mechanisms                                                           | C.3.9        | R    | R    | R    | HR   |
| 3h | Memorising executed cases                                                                  | C.3.10       |      | R    | R    | HR   |
| 4  | Graceful degradation                                                                       | C.3.11       | R    | R    | ŒR   | HR   |
| 5  | Artificial intelligence - fault correction                                                 | C.3.12       |      | NR   | NR   | Nr.  |
| 6  | Dynamic reconfiguration                                                                    | C.3.13       |      | NR   | NR   | NR-  |
| 7a | Structured methods including for example, JSD, MASCOT, SADT and Yourdon.                   | C.2.1        | HR   | HR   | HR   | HR   |
| 7b | Semi-formal methods                                                                        | Table<br>B.7 | R    | R    | HR   | HR   |
| 7c | Formal methods including for example, CCS, CSP, HOL, LOTOS, OBJ, temporal logic, VDM and Z | C.2.4        |      | R    | R    | HR   |
| 8  | Computer-aided specification tools [IEC 61508]                                             | B.2.4        | R    | R    | HR   | HR   |

Embedded Real-Time Systems

7

## Fleet Exposure & Probability

- Bigger fleets have increased exposure
  - 250 Million US vehicles @ 1 hour/day
     2.5 \* 10<sup>8</sup> hrs/day exposure
  - If "unlikely" failures happen every million hours... that's: 2.5 \* 10<sup>8</sup> hrs/10<sup>6</sup> hrs per event
    - -> 250 events every day
  - This is why  $10^8$  to  $10^{10}$  hrs is a typical goal
- Hardware components fail at ~10<sup>5</sup>-10<sup>6</sup> hrs
  - Need two independently failing components to get to 10<sup>9</sup> hours!
    - This motivates redundancy for life-critical applications (SIL 3 & SIL 4)
- For mission-critical systems, consider:
  - Fleet exposure = # units \* operational hours/unit
  - Number of acceptable failures
  - Compute failure rate = failures / hours; pick an appropriate SIL



Embedded Real-Time Systems

### **Best Practices For Critical Systems**

#### Characterize worst case failure scenarios

- Assign SIL based on relevant safety standard
- Use engineering rigor for software SIL
- Use redundancy for ultra-low failure rates
- Consider *fleet exposure*, not just single unit

#### Pitfalls

- Software redundancy is difficult, and diversity is usually impracticable
- Designer's intuition about "realistic" faults usually optimistic
  - At 10<sup>-9</sup>/hr, random chance is a close approximation of a malicious adversary
- Going through the motions not enough for SIL-based process

Embedded Real-Time Systems

9

# EMBEDDED SOFTWARE SAFETY OVERVIEW

Embedded Real-Time Systems





## Safety Culture: Everyone Is Sure It's Safe!

- Space Shuttle Challenger Mishap
  - January 1986 launch explosion; 7 fatalities
  - Dual O-rings keep hot gases inside solid booster
    - · History of sometimes failing if too cold
    - At launch, joint temperature was below freezing
  - Booster team told: "prove launch is unsafe"
    - Should have been: "no launch unless proven safe"
    - Getting lucky is not the same thing as being safe





Embedded Real-Time Systems

TABLE

### Overview of Embedded System Safety

- Safety Topics
  - Safety Plan & Safety Standards
  - Safety Requirements
  - Critical System Design
  - Dependability
  - Single Points of Failure
  - Redundancy Management
  - Isolation Mechanisms
  - Safety Architectural Patterns

(1985 - 1987) THERAC 25 Software-Controlled Radiation Therapy Mishaps

THERAC 25

Pitfall

Safety isn't just about whether you think it's safe ......

it's about whether you can prove it is appropriately safe

Embedded Real-Time Systems

#### **SAFETY PLAN**

Embedded Real-Time Systems

15

## Safety Plan: The Big Picture for Safety

#### Safety Plan

- Safety Standard: pick a suitable standard
- Hazards & Risks: hazard log, criticality analysis
- Goals: safety strategy, safety requirements
- Mitigation & Analysis: HAZOP, FMEA, FTA, ETA, reliability, ...
- Safety Case: safety argument



Embedded Real-Time Systems



### Safety Goals & Safety Requirements

- Safety Goal: top level definition of "safe"
  - Example: vehicle speed control
    - Hazard: unintended vehicle acceleration
    - Goal: engine power proportional to accel. pedal position
  - Safety strategy: how you plan to achieve goal
    - Example: correct computation AND engine shutdown if unintended acceleration
- Safety Requirements
  - Goals at system level; requirements provide supporting detail
  - Supporting requirements generally allocated to subsystems
    - Might include functionality and fail-safe mitigation requirements
  - Examples
    - Engine torque shall match accelerator position torque curve
    - Pedal/torque mismatch shall result in engine shutdown

Embedded Real-Time Systems

### FMEA: Failure Mode Effects Analysis

Idea: Start with component failure; analyze results; identify hazards

| Component    | Potential Failure Mode | Failure Effects                                       | Recommended Action         | Status |
|--------------|------------------------|-------------------------------------------------------|----------------------------|--------|
| Resistor R2  | Open                   | Open Triggers Shutdown Use Industrial spec. component |                            | Done   |
|              | Short                  | Over-current/<br>potential Fire                       | Circuit Redesign           | Open   |
| Capacitor C7 | Explodes               | Potential Fire                                        | Select different component | Open   |

- Significant limitations for generating hazards
  - "Complex component" failures are not well behaved
    - · Software fails however it wants to fail
    - · Integrated circuits are usually highly coupled internally
  - Poor at representing correlated and accumulated faults
    - · E.g., exploding capacitor damaging several nearby components

Embedded Real-Time Systems

19

### **HAZard and Operability Analysis (HAZOP)**

- Hazard structured brainstorming
  - For each system requirement
    - · Modify with a guide word
    - Does the result suggest a hazard?
  - Effective starting point, but not guaranteed to find all hazards
- Examples
  - When pressure exceeds 6000 psig, relief valve shall **NOT** actuate.
  - System shall come to a complete stop within AFTER 5 seconds when emergency stop is activated.
    - Alternately: System shall come to a complete stop within 5 seconds LATE when emergency stop is activated.

| Guide Word              | Meaning                                |
|-------------------------|----------------------------------------|
| NO OR NOT               | Complete negation of the design intent |
| MORE                    | Quantitative increase                  |
| LESS                    | Quantitative decrease                  |
| AS WELL AS              | Qualitative modification/increase      |
| PART OF                 | Qualitative modification/decrease      |
| REVERSE                 | Logical opposite of the design intent  |
| OTHER THAN /<br>INSTEAD | Complete substitution                  |
| EARLY                   | Relative to the clock time             |
| LATE                    | Relative to the clock time             |
| BEFORE                  | Relating to order or sequence          |
| AFTER                   | Relating to order or sequence          |

Embedded Real-Time Systems

#### Hazards & Risks

- Hazard: a potential source of injury or damage
  - A potential cause of a mishap or loss event (people, property, financial)
- Hazard log
  - Captures hazards for a system
  - HAZOP generates some hazards
  - Others are legacy & experience
- Risk evaluation
  - Risk = Probability \* Consequence <sup>∞</sup>
    - Typically determined via a risk table
  - Risk must be reduced to acceptable levels
    - Risk determines required SIL (e.g. "Very High" -> SIL 4)

I SIL (e.g. "Very High" -> SIL 4)

Embedded Real-Time Systems

2

**RISK** 

**Probability** 

Low

## Safety Analysis & Mitigation

- Failure Mode Effects Analysis (FMEA)
  - Work forward from fault to mishap
- Fault Tree Analysis (FTA)
  - Work backward from hazard to
  - Strategy: HAZOP identifies fault tree roots
- · Avoid single points of failure
  - If component breaks, is system unsafe?
  - Computational elements fail in worst way
- Life-critical systems require redundancy
  - Also avoid correlated faults
  - High-SIL software techniques to avoid SW defect



Fault Tree

Embedded Real-Time Systems

### **Best Practices For Safety Plans**

- · A written Safety Plan including
  - Hazards + risks
  - Safety goals + requirements
  - Safety analysis + Mitigation
  - Following a safety standard
  - Resulting in a written safety case
  - Independent audit of safety case
- Pitfalls
  - Software safety usually stems from rigorous SIL engineering
  - FMEA can miss correlated & multipoint faults must use FTA
  - Need to include safety caused by security attacks

Embedded Real-Time Systems

23

### **SAFETY REQUIREMENTS**

Embedded Real-Time Systems

### Specifying safety

- Safety goals: "working" is not the same as "safe"
  - How hazards are avoided at system level
  - Can involve correctness, backup systems, failsafes, ...
  - Often what the system does not do is as important as what it does
- Safety requirements
  - More detailed safety-specific requirements allocated to subsystems

Embedded Real-Time Systems

### **Identifying Safety-Related Requirements**

Requirement Annotation Approach R01. Lorem ipsum dolor sit amet, consectetur adipis

- Overly-simplistic approach
  - Start with system requirements
  - Annotate critical system requirements

  - Problem: Most requirements can become critical
- R02. Nam suscipit odio aliquam massa finibus, id imperdiet.

  R03. Quisque vehicula quam ut dui venenatis varius. esque aliquam sem sit amet justo porttitor. ☑ R07. Ut venenatis ante in ligula efficitur, congue 
   ⊠ R08. Nam a nulla ultrices, tempor quam et, fringilla nisl.

   ⊠ R09. Vestibulum a arcu interdum, placerat eros non, ultrices.
   R10. Ut commodo odio eu elit porttitor facil — Then, annotate supporting requirements ☐ R11. Ellam et sem eu eros conque sollicitudir.

  R12. Proint thicklutur arcu qui du thirstique volutipat.

  R13. Fusce quis magna allquet, venenatis sem ac, rhoncus ⊠ R14. Cras vel nulla eget ord semper varius sceleris
   R15. Cras moliis lorem vitae libero sollicitudin lobo
   R16. Vestibulum luctus nisi ac nibh varius congue. □ R17. Maecenas consequat augue eu venenatis euis
   □ R18. Quisque viverra felis in est ornare consectetur.
- Too many system components promoted to highest criticality level
  - Allocating even one critical requirement to a component makes whole thing critical

Embedded Real-Time Systems

### Safety Envelope Requirements Approach

- Safety Envelope
  - Specify unsafe regions for safety
  - Specify safe regions for functionality
    - Deal with complex boundary via
      - Under-approximate safe region (reduces permissiveness)
      - Over-approximate unsafe region
    - Trigger system safety response upon transition to unsafe region
- Partition the requirements
  - Operation: functional requirements
  - Failsafe: safety requirements (safety functions)



Embedded Real-Time Systems

2

### Architecting A Safety Envelope System

- "Doer" subsystem
  - Implements normal functionality
  - Allocate functional requirements to Doer
- "Checker" subsystem
  - Implements failsafes (safety functions)
  - Allocate safety requirements to Checker
- Checker is entirely responsible for safety
  - Doer can be at low SIL (failure is lack of availability)
  - Checker must be at high SIL (failure is unsafe)
    - · Often, Checker can be much simpler than Doer

Embedded Real-Time Systems



### Safety Requirements Best Practices

- Doer/Checker pattern
- Good safety requirements
  - Trace to system-level safety goals
    - Orthogonal to normal functional operation if possible
  - Make safety simple to validate (test, peer review)
    - · Safety testing mostly exercises the Checker box
- **Pitfalls** 

  - Tradeoff between simplicity and permissiveness
     Doer optimality costs Checker validation effort
     Fail-operational functions may require multiple Doer/Checker pairs

 Functional requirements allocated to low-SIL Doer and Safety requirements allocated to high-SIL Checker

Good safety requirements

Transport to such as a super binding dum to the ventual service. Since the such as a super binding, when the such as a super binding as DOER OUTPUTS

Embedded Real-Time Systems

#### SINGLE POINTS OF FAILURE

Embedded Real-Time Systems

## Avoid Single Points of Failure

- Fault Containment Region (FCR)
  - Faults from outside FCR are kept out
    - Faults inside FCR are kept in
  - But, within FCR a single fault has <u>arbitrarily bad</u> effects
    - It's like a blast inside the FCR
    - Applies to both SW faults and HW faults (e.g., single event upsets)



Embedded Real-Time Systems

31

## Eliminating Single Points of Failure

- Multiple FCRs required for life-critical and highly mission-critical systems
  - This isolates faults in redundant components – no single point of failure
  - Avoid an Achilles' Heel in your system
    - All software on CPU can be a "single point"
- Multi-channel (e.g., 2 of 2)
  - Compare identical component outputs
- Doer/Checker (monitor/actuator pair)
  - "Checker" makes sure "Doer" is safe
- Safety gate
  - Only permits safe outputs to issue



Embedded Real-Time Systems

#### Correlated & Accumulated Faults

- <u>Correlated</u> faults if multiple FCRs are likely to fail together
  - Common design faults (including software)
  - Common manufacturing faults
  - Shared infrastructure (e.g., power, clock)
  - Physical coupling
    - · Shared wiring harness, connectors
    - Shared location (e.g., hot spot)
- Accumulated faults
  - Fault not detected
  - Fault not repaired before next mission

Embedded Real-Time Systems

33

# Best Practices To Avoid Single Points of Failure

- Safety is improved by using multiple FCRs
  - Hardware redundancy / HW isolation
    - Typically each FCR should be an independent chip
  - Software must be practically "perfect"
  - Common patterns: multi-channel, checker, safety gate



- Two copies of same SW fail the same way
- Ensure multi-channel doesn't fail as "always trust one channel"
- Ensure the checker doesn't fail as "always checks OK"
- Look for hidden correlation (HW design defects, shared libraries, shared requirement defects, physical connection, shared clock, shared power, ...)

Embedded Real-Time Systems



#### CRITICAL SYSTEM ISOLATION

Embedded Real-Time Systems

35

## Critical System Isolation

#### Need isolation between different SILs

- Lower SIL assumed to compromise High SIL
  - Higher SIL -> "trusted" (critical tasks)
  - Lower SIL -> "untrusted" (non-critical tasks)
    - Corrupts high-SIL data values, timing, configuration
- Hardware isolation is best option
  - Different SILs separated on different chips
  - Different networks for safety vs. nonsafety data
    - Network data exchange is safety critical



Embedded Real-Time Systems

#### Mixed-SIL Interference Examples Memory value interference • Non-critical task modifies critical variables • Non-critical ISR causes critical task stack overflow • Non-critical task memory leak; heap exhaustion CRITICAL MEMORY CPU time interference **TASK** (High SIL) • Non-critical task runs at high priority; starves critical tasks • Non-critical task disables interrupts; delaying critical tasks Watchdog Watchdog timer Timer Non-critical task kicks watchdog regularly NON-• Non-critical task disables watchdog Config. **CRITICA** Data System configuration **TASK** • Non-critical task changes digital output to input (Low SIL) Network Network Non-critical node sends unsafe critical message

Embedded Real-Time Systems

#### Mitigating Cross-SIL Interference Develop all software at highest SIL CPU CORE #1 CPU CORE #2 Avoids isolation, but increases expense and Hardware solution - separate CPU chips L1 Caches L1 Caches Multi-core provides only partial isolation **High-SIL RTOS approaches** L2 Cache Hardware memory protection (MMU) INTERFERENCE: Hardware CPU time isolation (e.g., multi-**Bus Interface** MULTICORE **PROCESSOR** Virtualization of I/O and configuration INTERFERENCE? Other techniques can help for Low-SIL Variable mirroring (two one's complement MEMORY & I/O copies) INTERFERENCE? Critical tasks run at high priorities or in ISRs Non-modifiable watchdog timer configuration HIGH SIL Self-test is insufficient for High-SIL integrity **PRIMARY** - Fault in high SIL hardware can subvert self-Single CPU at SIL 3 or SIL 4 Embedded Real-Time Systems

## **Isolation and Security**

- Lower-SIL task is ~ a malicious attacker
  - How can it disrupt higher-SIL software?
  - Consider: memory corruption, timing, configuration, network
- Implications for safety
  - A weaker fault model means making assumptions
  - Lower-SIL update means revisiting assumptions
- · Implications for security
  - Higher-SIL functions more resistant to attack if isolated
  - Bad pattern: everything on one CPU with desktop OS
  - Better pattern: isolated CPUs with high-SIL critical RTOS

Embedded Real-Time Systems

39

### **Best Practices For Critical System Isolation**

- Use as much hardware isolation as you can
  - Consider
    - Data value isolation
    - · CPU time isolation
    - Configuration corruption
    - · Shared resource isolation
  - Applies to any different SILs
    - Crucial for non-SIL <-> SIL 3/4
- Pitfalls



 IEC 60730: Arguing that low-SIL software won't interfere..... requires re-arguing after every low-SIL change

Embedded Real-Time Systems

